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Abstract 

Consider a possibly non-linear (n, K, d)q code. Coordinate i has locality r if its 
value is determined by some r other coordinates. A recent line of work obtained an 
optimal trade-off between information locality of codes and their redundancy. Further, 
for linear codes meeting this trade-off, structure theorems were derived. In this work 
we give a new proof of the locality / redundancy trade-off and generalize structure 
Q , theorems to non-linear codes. 

1 Introduction 

r— I ■ 

■ We say that a certain coordinate of an error- correcting code has locality r if, when erased, 

■ the value at this coordinate can be recovered by accessing at most r other coordinates. 
! Motivated by applications to data storage |HSX+12] the authors of |GHSY12] introduced 

^ \ (r, (i)-codes, which are systematic codes that have distance d and thus tolerate up to c? — 1 

erasures, but also have the property that any information coordinate has locality r or less. 
They established that in all linear [n, d]q codes with the (r, (i)-property 



\^ . n> k + 



+ d-2. (1.1) 



In what follows we refer to codes that meet (11.11) with equality as optimal. A construction 
of |HCL07] implies that optimal codes exist for all values of parameters. In the natural 
setting of r\k, the lower-bound argument of |GHSY12] yields structure theorems for optimal 
linear codes. These theorems are particularly strong when d < r + 3. In particular, in 
that case they imply tight lower bounds for the locality of parity coordinates. The results 
of |GHSY12] were recently extended by |PD12i lSAP+13j who generalized the inequality (II. ip 
(but not the structure theorems) to non-linear codes. 

In this paper, we further extend the line of work above. We first give a new proof of 
the lower bound (II. ip for non-linear codes. Then, as in |GHSY12] . we use the lower-bound 
argument to derive structure theorems for optimal non-linear codes. 
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The main technical problem that we have to address going from the lower bound to 
structure theorems is that of reversibility of local constraints. In linear codes, any local 
constraint on coordinates in the code must be a linear constraint, and linear constraints are 
trivially reversible, in that knowing all but 1 coordinate in the constraint always determines 
that coordinate, regardless of the identity of that 1 coordinate. However, for non-linear codes 
it is possible to have local constraints that are not reversible. For example, it is possible 
for the coordinates {i, i'} to determine the coordinate i", but for the coordinates {i', i"} to 
not determine the coordinate i. However, we show that for optimal (r, c?)-codes, even in 
the non-linear case, all locality constraints must be reversible. Once this is established, the 
structural results of |GHSY12] can then be extended to the non- linear case. 

2 Preliminaries 

We will first fix some notation, then define the objects we will be considering. 

2.1 Notation 

Throughout, we consider codes which may be non-linear over an arbitrary alphabet S, where 
|S| = > 2 is an arbitrary integer. Given two vectors x,y G S", A{x,y) will denote the 
unnormalized Hamming distance x and y. For S* C [n], we will denote x\s for the sequence 
of symbols in x with coordinates in S. When S = {i} we will just write x\i. For an integer 
n > 0, [n] denotes the set {1, . . . ,n}, where [0] will be understood as the empty-set. For 
disjoint sets A and B, we write AU B to denote their disjoint union. 

2.2 Definitions 

Recall the definition of a code, which we do not assume to be linear. 

Definition 2.1. A {n,K,d)q code is a subset C C E" with size \C\ = K, such for any 
X ^ y E C, A(x, y) > d. IfC'^C then C is a sub-code ofC. The parameter n will referred 
to as the block-length, k = log^ K the dimension and d the distance. 

The code is systematic if k E'L, and there is an encoding function Enc : S'^ — )■ S" such 
that for X G Tl' , Enc(x)|j = x\i, fori G [k]. 

A systematic code takes on all values in its first k coordinates, and the values of these 
coordinates determine the rest of the codeword. The first k coordinates of the codewords are 
thus referred to as the information symbols, other coordinates will be called parity symbols. 
This work will be interested in codes with local constraints on the information symbols. 

Definition 2.2. A systematic (n, fT, d)q code has information locality r if for every 
i E [k], there is a size < r subset S [n] \ {i} such that for any x E C, x\i is determined by 
x\s- 

Other symbols, other than the information symbols, can also have locality, and occasion- 
ally we will use this. 
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3 Locality Lower Bounds 



In this section we establish lower bounds on the block-length of codes with small information 
locality. We will then prove structural results for codes meeting this lower bound. As we 
will often use it, we now prove the Singleton bound. 

Lemma 3.1 (Singleton Bound). Let C be an {n,K,d)q code, with K > 1. Then, 

n > \ogg K + d-l 

Proof. As K > 1, there are at least two distinct codewords. These codewords are at least 
d-apart. Thus d < n. Therefore we can delete the first d—1 coordinates from each codeword, 
resulting in a code C C E""'^"''^. The new code C has distance > 1, as each pair of original 
codewords have distance > d and we only deleted d—1 coordinates. Thus, we have an 
injective map from C to S""''"*"^, and thus log^ K < n — d + 1. □ 

The lower bounds we derive for local codes will follow by analyzing Algorithm [H This 
algorithm will use the local constraints of the code to iteratively find large sub-codes Cj C 
CCS". In this process, the (effective) block- length will decrease faster than the dimension 
of the sub-codes, while maintaining the distance. Thus, the sub-codes become more optimal 
in terms of rate, and eventually we can apply the Singleton bound, to bound this process. 

Algorithm 1 Finding sub-codes via locality 
1: procedure sub-code(C, n, k, d, r, q) 

2: Co=C 
3: J = 

4: while \Cj\ > 1 do 
5: j + 1. 

6: Choose ij such that ij ^ Rj-i ■= IJj'e[j-i]('^i' {^i'D ^^^^ — ^ subset of 

coordinates Sj C [n] \ {ij} determine the coordinate ij, for all x G C. 
7: Let (Tj G Sl'^jl be the most frequent element in the multi-set {x\sj '■ x G Cj-i}. 

8: Define Cj := {x : x G Cj^i,x\s = 

9: end while 
10: end procedure 



Theorem 3.2. Let C be a systematic (n, g^,(i)g code with information locality r. Th 



en 



n > k + 



+ d-2. 



Proof. We first show that Algorithm [T] is well-defined. In particular, in Line El such an 
ij exists. For, by hypothesis \Cj^i\ > 1, implying there are x y E C C. As C is 
systematic, such codewords are determined by their first k coordinates, and thus must differ 
on at least one of those k coordinates. Further, they will not differ on coordinates in Rj-i, 
as those coordinates are fixed, as we fixed the coordinates in Sj' for j' < j by construction 
in Line [S] and this also fixes the {ij'}j'<j by locality. Thus, any coordinate where x and y 
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differ will suffice for ij. In Line [6] the set Sj exists by our locality assumption on the code C. 
Thus, the algorithm is well-defined. 

We now analyze the algorithm. We first show that each new sub-code is not too small. 
Define Tj := Sj\ Rj-i and define tj := \Tj\, which is the number of coordinates we fix in 
constructing Cj that are not necessarily fixed by prior loops in the procedure. It follows 
that there are < g*^ many possibilities for the aj in Line [71 and thus \Cj\ > by 
averaging. 

Given that the sub-codes are not shrinking too fast, we can use this to lower-bound the 
number of iterations of the algorithm. Let i denote the largest value of j such that \Cj\ > 1 
in the algorithm, and thus |C£+i| = 1. By the above bound on sub-code size, we see then 
that, 

i+i 

= logJC,+i| >k-J2tj 

i=i 

and so as tj < \Sj\ < r, 

k < (£ + l)r 



or equivalently, 

'k' 

> 

' r 



1. 



We now reduce to the Singleton bound. We first note that Rg is the disjoint union 
Ri = \_fj=i{Tj U {ij}) and thus \Ri\ = i + Yl^j=i'^j- This decomposition of Ri is disjoint as 
the Tj are disjoint by construction, the ij are distinct by construction, and the ij are disjoint 
from the Tj/: ij ^ 5*^ 3 Tj by definition of the Sj, ij ^ Tj/ for j' > j by definition of Tji, and 
ij ^ Rj' ^ Tj/ for j' < j by definition of ij. 

Now consider the code Q C C C S", whose codewords all agree on R^, by construction. 
As it is a sub-code of C, also has minimum distance > d. It follows that we can delete the 
coordinates R^ C [n] and bijectively map Cg to the new code C C which still has 

distance > d. By the above arguments, we see that 

log^ \C'\ = log^ \C,\ > log^ \C\ - = log, \C\ - \R,\ + £ 

i=i 

Thus, applying the Singleton bound (Lemma 13. ip to C , we get that, 

n - \RA >k - \Ri\ +i + d-l 



and thus 

n > k + 



r 



+ d-2. □ 



We now turn to structural results on optimal local codes, that is, codes with information 
locality r that meet the above bound ofn = k + [A;/r] + d — 2. For simplicity, we will 
restrict ourselves to the case that r\k so that k/r is integral. We will first show the local 
structure for r = k, in which case the code is of parameters {k + d — 1, q'', d)g and is thus a 
maximum-distance separable (MDS) code. 
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Lemma 3.3. Let C be a {k + d — 1, g^, d)q code, that is, a MDS code. Then for any subset 
of coordinates S [k + d — 1] of size k, the multi-set {x\s : x G C} takes on all values in S'^ 
exactly once, and for any x & C, x\s determines x. 

Proof. Suppose x,y E C (possibly equal) have x\s = y\s- Then as \S\ = k, it follows that 
^{x,y) <{k + d — 1) — k = d— l<d. As the minimum distance of any two distinct 
codewords in C is > d, it follows that x = y, so x\s determines x. 

Thus, as there are codewords, and the values of the coordinates x\s (taking values in 
S'^, and = q^) determine the codeword, it follows that each value in T,^ is taken by x\s 
exactly once, ranging over x G C. □ 

This lemma shows that for each symbol, MDS codes have locality k and no less. When 
r\k but r < k the situation becomes more complicated, and we derive the following result 
(Theorem I3.4p . The heart of this result is item ([2]), which establishes that in non-linear 
optimal (r, (i)-codes, all local constraints will constrain the involved symbols equally, and are 
thus reversible. This fact is trivial for linear codes, but more involved for non-linear codes. 
Once established, the arguments can follow the case for linear codes as done in [GHSY12] . 

Theorem 3.4. Let C be a systematic {n,q'',d)q code with information locality r, with r\k 
and r < k. Suppose n = k + ^ + d — 2. Let (possibly equal) information coordinates i, i' E [k], 
have associated subsets S C [n] \ {i} and S' C [n] \ {i'} of size < r, such that x\s determines 
x\i and x\s' determines x\i>, for all x E C. Then 

1. \S\ =r. 

2. For all i" E S U {i}, x|(su{i})\{i"} determines x\i", for all x E C. 

3. S VJ {i} and S' U {i'} are either equal or disjoint. 

4. Up to a permutation of coordinates, C is a code with k information symbols I, k/r 
parities L, each depending on a disjoint set of r information symbols, and d — 2 other 
parties H , depending arbitrarily on the k information symbols. 

Proof. The proof will be by analyzing particular runs of Algorithm [T], using the analysis of 
Theorem 13. 2[ By showing that certain inequalities in that analysis must be tight, we will 
derive the desired results. 

We first establish further properties of the algorithm, in the case that n = k + k/r + d — 2, 
by extending the analysis given in Theorem 13.21 In particular that analysis shows that 

n>k+i+d-l 

and 

e+i 
i=i 

where tj < \Sj\ < r. By the hypothesis that n = k + k/r + d—2, we have that i < k/r — 1 E Z, 
from which it follows that 

e+i 

< (^+ l)r < A; 
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Combining the above two equations shows that these inequahties, and those inequahties used 
to derive them, must be met with equahty. In particular, we have that tj = \Sj\ = r for all 
j, and i = k/r — 1. From this, we can derive that the subsets Sj U {ij} have r + 1 distinct 
elements, and the family of subsets {5*^ U {ij}} are disjoint. 

Further, it follows that the inequalities used in the analysis of Theorem 13.21 to establish 
that "n > k + i + d — 1" must also be tight. In particular, \Cj\ = \Cj^i\/q^, for all j, 
implying that \Cj\ = q^~^\ for all j. By the construction of Cj in Lines [THSl it follows by 
an averaging argument that the multi-set {x\sj '■ x e Cj_i} has distinct elements (the 
maximum possible), each appearing equally often. In particular, this implies that any choice 
of a e in Line[7]is valid, for any choice of the {(?j'}j'<j in the prior iteration of the loops. 
Further, once we have chosen these k/r = i + 1 values aj, there is a unique codeword in 
X E C such that x\s^ = (Tj for all j, as \Ck/r\ = 1 by construction. It follows then that the 
values in the k coordinates \-\jSj are completely independent, take on all q'' possible values, 
and uniquely determine a codeword in C. 

We will now use these facts applied to particular runs of the algorithm. 

([1]): Consider Algorithm [1] where we choose <(— z and ^ S. The choice of ii is valid, 
for we are free to choose any ii G [n], as Ri = 0. The choice of 5*1 is also valid, as 5* is a valid 
locality constraint on ii. We then continue with Algorithm [H to define the sets Sj and codes 
Cj. From the analysis above, it follows that \Sj\ = r for all j, in particular IS"! = IS*!! = r. 

([2]): For each f G S*""^, consider the map : S — i- E, such that /r(a;|j//) = x\i for all 
X E C such that x\s\{i"} = t. This is well-defined, as the coordinates S = {S \ {i"}) U {i"} 
determine the coordinate i by locality, and the coordinates in S take on all q^ values by the 
analysis above, so for each r, ff must be defined for each input. The claim will be established 
by showing that, for each r, this map is in fact a bijection, which by will follow from showing 
that it is injective. To do this, we will analyze properties of sub-codes of C, which will follow 
from analysis of Algorithm [TJ 

As in ([T]), we first run Algorithm [T] with i and 5*1 ^ S, and the algorithm yields 

the k/r coordinates {ij}jG[k/r] and respective k/r locality constraints {Sj}j^[k/r], such that 
the family of (r + l)-sized subsets {Sj U {ij}}j are disjoint. 

Now observe that choosing the {Sj U {'ij}}j in reverse order is also a valid run of Algo- 
rithm [H because the conditions in Line [6] hold regardless of the order of j in the sequence 
{Sj U {'ij})j- Consider the code C resulting from the algorithm run in reverse order, where 
we have fixed the coordinates {Sj U {ij}}j>i, and so C is the last code in this reverse-run 
algorithm that has more than one codeword. As such, \C'\ = q"^, by the analysis above. It 
has n — {k / r — l){r + 1) = r + d — 1 non-fixed coordinates, and distance > d. The non-fixed 
coordinates include i, S, and the remaining — 2 coordinates 0. This implies that C (when 
the fixed coordinates are dropped) is an MDS code. In particular, consider the two sets 
of coordinates, 5" and S U {i} \ {i"}, in the code C. By Lemma [331 both of these sets of 
coordinates take on all possible values in E'', and determine the other d — 1 coordinates in 
C. 

We now show ff is injective, for any f G T,^^^. Suppose /r(p) = /r(w) = cr, for some 

^Note that d > 2 is implied here. That d = 1 is impossible can be seen most easily in the case when 
r = k. For then, the parameters imply that the code is simply the list of all q'^ words, and so no information 
locality is possible, for each coordinate is fully independent of the rest. That c? = 1 is impossible for r < fc 
can be seen by reducing to a sub-code where r = k via Algorithm [1] 
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(possibly equal) p,co,cr G S. As the coordinates S take on all values in C, there are two 
codewords, x,y such that x\s\{i"} = y\s\{i"} = ^) ^U" = P and y\i>f = co. By the locality 
in C, this means that x\i = y\i = a. By the locality in C just established, is determined 
by x\s\{i"} = y\s\{i"} = and x\i = y\i = a, and thus p = x\i" = y\i" = u. Thus, must be 
injective, for every r. 

(|3]): Suppose S U {i} and S' U {z'} are not equal, and we seek to show they are disjoint. 
Let i" be the index of a coordinate in one of the sets but not the other, and without loss of 
generality, i" e {S U {i}) \ (5" U {i'}). Define 5"' := S U {i} \ {i"}, and observe that by 
applied to i" & S U {i} implies that the coordinates in S" determine the coordinate i". 

Now run Algorithm [H choosing ii ^ i', Si ^ S', and 22 ^ i", S2 ^ S". The first round 
choice of coordinate/locality is clearly valid. That a second round will even by executed 
follows from the above analysis, as the code Ci has size > 1, using that r < k. Further, 
the choices of coordinate/locality are valid because i" ^ Ri = S' U {i'} and S" determines i" 
as established above. Thus, we have a valid initial run of the algorithm, which we can than 
run to completion to yield the family {5*^ U By the above analysis of the algorithm, it 

follows that the sets {Sj U {ij}}j are disjoint. In particular, 5" U {i'} and S" U {i"} = SU {i} 
are disjoint, as desired. 

(B]): The above analysis shows that a run of Algorithm [1] yields the k/r disjoint sets of 
coordinates {Sj}j that take on all q'' values and uniquely determine the codeword. Treating 
these k coordinates as information symbols, and the corresponding coordinates {ij}j as the 
k/r parities L, we get the desired reordering of the coordinates. □ 

The above analysis shows that the locality constraints are disjoint sets of size r + 1, but 
does not indicate how many such constraints there are. In the case that d < r + 3, we can 
show that there are exactly k/r such local constraints, and can characterize their structure. 
In fact, just as in the linear case in |GHSY12] . we see that the locality structure of optimal 
(r, (i)-codes resembles that of Pyramid codes [HCL07] . 

Theorem 3.5. Let C be a systematic {n,q^,d)q code with information locality r, with r\k 
and r < k. Suppose n = k+ - + d — 2 and d < r + 3. Then the k/r + d — 2 parity symbols 
can be partitioned into L and H , with \L\ = k/r and \H\ = d — 2, where 

1. The parities in L, each depend on a disjoint subset of size r of the k information 
symbols. 

2. The parities in H , each depend on all of the k information symbols. 

3. The parities in L have locality exactly r. 

4. The parities in H have locality > k — {k/r — l){d — 3) > r. 

Proof. ([1]): We continue with the analysis given in the proof of Theorem 13.41 In particular, 
it implies that the n coordinates carry a family of subsets of size r + 1, such that any r 
of the r + 1 coordinates determine the other. By hypothesis, each information coordinate 
participates in such a subset, and the analysis in Theorem 13.41 shows that there are at least 
k/r such subsets. This leaves at most n — k/r ■ {r + 1) = d — 2 coordinates that do not 
participate in these locality constraints, and asd — 2<r + l, there are in fact exactly k/r 
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disjoint locality constraints and exactly d — 2 coordinates not covered by locality constraints. 
Thus, there are k/r ■ (r + 1) — k = k/r parity symbols participating in locality constraints, 
and let this set be L, which has the desired properties. 

(|2]): We now show that the parities in H := [?t,]\([A;]UL) depend on each of the information 
symbols. Consider some information symbol i E [k], and consider a 7^ a' G S. Let x, x' G E*^ 
be such that x\i = a and x'\i = a', and x\j = x'\j for i ^ j E [k]. It follows then that 
Enc(a;) 7^ Enc(x' ), and these codewords agree in {k — l) + k/r — l places: they agree in — 1 
information symbols by construction, and they agree in all but one of the k/r light parities 
(the light parity grouped with coordinate i being the exception). As the code has minimum 
distance d, and there are only d coordinates that these distinct codewords can differ on, 
it follows that Enc(x) and Enc(x') differ on all these coordinates, in particular the d — 2 
heavy parities H . Thus, changing any information coordinate will change all heavy parities, 
showing that each coordinate in H depends on each of the k information coordinates. 

(|3]): This follows from Theorem 13. 4[ as the light parities L can, under a permutation of 
coordinates, be regarded as information symbols, and thus cannot have locality < r. 

dl]): Let [k] = L\j^[k/r]Ij be the partition of the information symbols into size r subsets 
based on the grouping defined by the light parities. Pick arbitrary aj G S*" for j G [k/r]. 
Define the code Cj C C to be Cj := {x : x E C, x\^ = G [k/r] \ {j}}. It follows that in 

Cj, all light parities are fixed except for the j-th. Thus, Cj has (r + 1) + — 2) = r + d — 1 
unfixed symbols and has q'^ codewords, as all but r information symbols are fixed. As Cj is 
a sub-code of C, it follows that it has distance > d and is thus is an MDS code. 

Consider any heavy parity h E H determined by a set of coordinates S E [n]\{K} for all 
codewords in C, and thus for all codewords in Cj for any j. By Lemma [3.31 we see that any 
r symbols in Cj are independent, so any locality constraint for h in Cj must involve at least r 
other symbols. As codewords in Cj are only unfixed on the coordinates Ij and H, it follows 
that \S r\{IjVJ H\ {h})\ > r. Ranging this inequality over all j E [k/r], and accounting for 
double counting over H \ {h}, we see that \S\ > k ~ {k/r — l){d — 3), as desired. □ 
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